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ABSTRACT  I 

We  consider  information  retrieval  when  die  data,  fm’  instance  multimedia,  is  computa¬ 
tionally  expenave  to  fetch.  Our  ^qfnroach  uses  "information  filters"  to  considerably  narrow 
the  universe  of  possibilities  before  retrieval.  Tlien  decisions  must  be  made  about  the  neces¬ 
sity.  order,  and  concurrent  processing  of  proposed  filters  (an  "execution  plan").  We  develop 
simple  polynomial-time  local  criteria  for  optimal  execution  plans,  and  show  that  most  forms 
of  concurrency  are  subc^timal  with  infcmnation  filters.  Although  the  general  problem  of 
finding  an  optimal  execution  plan  is  likely  exponential  in  the  numba  of  filters,  we  show 
experimentally  that  our  local  <q>timality  criteria,  used  in  a  polynomial-time  algoridun,  nearly 
always  find  die  global  qiiimum  with  IS  fillers  or  less,  a  sufficient  number  of  filters  for  most 
tpplicadons.  Our  methods  do  not  require  special  hardware  and  avoid  die  high  processor  idle¬ 
ness  that  is  characteristic  of  masstve-paitdlelism  solutions  to  this  problem.  We  apply  our 
ideas  to  an  important  tpplication,  information  retrieval  of  ctqitioned  data  using  natural- 
language  understanding,  a  problem  fw  which  the  natural-language  processing  can  be  the 
botdeneck  if  not  implemented  well 
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1.  iBtrodnctioii 

Infonnation  retrieval  of  multimedia  data  differs  from  traditional  information  retrieval  in  that  the  data  can  be  so 
much  costlier  to  fetch  because  it  is  so  much  larger.  Thus  high  recall  (retrieval  of  all  relevant  data  in  the  data¬ 
base)  and  high  precision  (yidd  of  rdevant  data  in  the  fetched  set)  are  more  important  than  in  citation  retrieval. 
So  imne  effmt  is  needed  before  data  fetch,  and  it  is  important  to  find  the  best  way  to  do  it.  The  best  ways  are 
not  necessarily  the  same  as  the  best  ways  for  traditional  dtuabase  systems,  as  discussed  in  work  such  as  [3,  S. 
12, 19]. 

We  are  exploring  the  concqrt  of  "irtformation  filters”  [2]  to  improve  query  perframance.  These  processes  take 
as  input  a  set  of  media-object  pointers,  and  return  the  subset  that  pass  some  necessary  but  not  sufficient  condi¬ 
tions  f(x  a  data  match.  Different  filters  can  work  on  sqjarate  parts  of  a  query,  or  on  sqKuate  media  if  each 
datum  is  multimedia  (as  when  pictures  have  associated  text  captions  or  audio).  We  assume  here  that  filters  err 
only  on  the  side  of  caution,  so  that  they  never  exclude  relevant  media  objects.  Evot  though  detailed  examina- 
tkm  of  the  data  would  subsume  their  results,  information  filt^  can  be  cost-effective  if  their  cost  is  significantly 
less  than  a  data  match.  But  not  all  filtm  are  cost-effective,  nor  all  ways  of  using  them,  and  we  need  to  develop 
a  theory  fOT  using  them. 

Signature  matching  [4,  8,  10,  1 1]  is  a  special  case  of  infonnation  filtering  that  has  been  firuitfully  iq>plied  first  to 
text  data  and  tiien  to  multimedia  data.  It  extracts  the  key  words  in  text,  the  key  sh£q>es  in  pictures,  or  the  key 
sounds  in  audio,  and  hashes  them  into  a  "signature  ttdtle".  At  query  time,  query  words  or  feahires  are  also 
hashed  into  the  signature  table.  A  hash  hit  on  any  word  or  feature  is  a  necessary  but  not  sufficient  condition  for 
an  exact  match  betweot  the  query  and  some  multimedia  datum  that  was  hashed  there.  The  signature  file  can  be 
stored  in  main  memory,  and  searching  it  can  be  considerably  faster  than  searching  a  secondary-storage  index  to 
the  data.  Thus  signature  matching  is  a  special  case  of  information  filtering  as  we  defined  it  above.  Signature 
matching  does  usually  require,  however,  that  each  media  object  be  described  manually,  the  description  must 
anticipate  most  future  queries  (which  is  difficult,  as  discussed  in  [20]),  and  the  match  be  exact. 

Some  ways  of  signature  matching  are  better  than  others.  As  an  example,  suppose  a  user  wants  to  find  a  picture 


-3- 


oi  an  F-18  aiicfaft  on  a  ninway  in  a  database  of  ofitioned  pictures,  a  typical  query  of  those  we  have  been 
studying  fx  the  Photo  Lab  of  the  Naval  Air  Warfare  Center.  Werqwns  Division  (NAWC-WD).  China  Lake.  Cal¬ 
ifornia.  USA  [21].  This  could  be  decomposed  into  four  filters  applied  to  the  picture  database:  (1)  jnctures  F- 
18s.  (2)  pictures  of  runways.  (3)  pictures  of  things  "on”  other  things,  and  (4)  pictures  of  F-I8s  on  runways. 
Intuitively,  it  would  seem  best  to  {q)ply  the  first  three  filters  in  that  order  because  "F-18”  is  quite  q)ecific.  "run¬ 
way”  less  so.  and  "on"  even  less  so.  Intuitively,  it  would  also  seem  that  the  third  filter  could  be  eliminated  in 
deference  in  the  fourth,  because  almost  always  an  aircraft  is  on  a  runway  and  not  beside  it.  underneath  h.  or  in 
some  other  rdafimship  to  it.  But  we  need  mathematics  to  justify  these  intuitions. 

The  primary  objective  of  this  paper  is  to  present  a  general  theory  of  optimal  information  retrieval  with  informa¬ 
tion  filters.  While  the  theory  applies  to  all  kinds  of  filters,  it  contains  important  new  results  about  signature  and 
other  redundant  filters,  which  have  not  previously  been  carefully  analyzed  as  abstracted  components  of  an 
information-retrieval  system.  We  will  use  a  simple  model  of  filter  cost  that  corresponds  closely  to  that  of  most 
information-retrieval  implementations.  We  will  first  examine  in  sections  2  and  3  the  most  common  land  of 
multi-filtering,  conjunctive  filtering,  and  provide  simple  local  optimality  conditions  on  a  conjunctive  sequence. 
The  local  qptimality  conditions  concern  interchanges  of  filter  order,  deletion  of  redundant  filters,  insertion  of 
redundant  filters,  and  concurrent  execution  of  filters.  Surprisingly,  we  will  prove  that  with  a  general  cost  model, 
most  fcxms  of  concurrency  are  not  desirable  with  information  filtering,  since  the  earlier  starts  of  the  ctmcurrem 
filters  do  not  compensate  for  the  increased  input  they  must  handle.  Section  3  will  show  results  of  experiments 
confirming  the  value  of  our  local  optimality  criteria,  and  in  particular  that  a  simple  "greedy"  algorithm  based  on 
them  has  excellent  average  performance. 

We  then  genoalize  our  results  to  arbitrary  boolean  expressions  involving  filters  in  section  4.  Disjunctive 
sequences  are  just  the  duals  of  conjunctive  sequences,  and  negations  are  relatively  straightforward  in  tiieir 
optimality  implications.  Factoring  of  conjuncts  ovct  disjuncts  and  vice  versa  leads  to  an  additional  local 
optimality  condition,  but  (me  that  we  argue  is  unimportant  in  most  tqjplications. 

One  important  a|q>Iication  of  signature  matching  is  to  sunxnting  natural-language  processing  in  infcmnation 
retrieval.  Many  researchers  in  infcnmation  retrieval  have  pcnnted  out  the  deficiencies  of  raw  keyword  matching 


j 


-4- 


(e.g.  [16]),  and  parsing  and  semantic  intetpietation  of  natural-language  data  desci^ons  could  be  a  solution. 
The  main  obstacle  is  speed.  Depending  on  the  i^jproach,  the  required  parsing,  semantic  interpretation,  and 
semantic  matching  could  taicn  minutes  where  keyword  m^ching  requires  seconds.  But  if  we  can  decompose  the 
required  foocessing  into  several  filters,  we  may  be  able  to  rule  out  most  potential  data  at  an  early  stage,  without 
fuU  natural-language  processing  of  their  data  descriptions.  We  discuss  this  in  section  5  of  this  papa,  and  our 
theray  permits  us  to  improve  the  MARIE  system  [21],  which  pioneered  in  the  systematic  use  of  natural-language 
aq>tions  on  multimedia  data  as  the  primary  indexing  of  the  data. 

2.  General  analysis  of  coiijunctive  information  filtering 

Siqqpose  we  have  m  information  filters  through  which  some  data  items  must  pass;  that  is,  each  data  item  must 
pass  the  test  administered  by  each  filter.  Let  the  event  of  passing  the  filter  be  termed  /j .  Assume  e^h  filter 
has  an  average  cost  of  execution  per  data  item  of  c,  and  an  average  a  priori  probability  of  passing  the  data  item 
of  p(fi),  where  0<^(/’j)<l  to  avoid  considering  trivial  cases.  Generally  the  c,  will  be  execution  times  so  as  to 
find  the  minimum  execution  time  of  a  filter  sequence,  but  our  mathematics  here  applies  to  any  costs.  Assume 
that  the  cost  of  testing  whedier  an  item  passes  a  filter  is  a  constant  indq)endent  of  the  success  or  failure  of  the 
test;  this  is  true  of  table  locfinips,  fOT  instance. 

If  the  filters  are  applied  in  sequence,  the  expected  total  cost  per  data  item  will  be; 

fl,m  =  Cl  +  CiPV  l)  +  C3p(f  i/Va)  +  •  •  •  +  C„p(fi/^2  •  ■  •  d) 

We  would  like  to  choose  the  filter  sequence  that  minimizes  r,^  for  a  set  of  m  filters,  or  know  if  possible  dele¬ 
tions  of  filters  could  improve  cost.  The  parameters  c  and  p  can  be  estimated  either  firom  past  statistics  of  the 
filt^  on  similar  problems,  or  by  random  sampling  of  a  small  firaction  of  the  database  and  applying  the  filters  to 
it 

Rdated  problems  to  the  finding  the  cqMimal  conjunctive  sequences  have  been  examined  elsewhere.  In  the  data¬ 
base  literature  diis  is  the  problem  of  restriction-order  optimization  based  on  selectivities  (called  a  "single¬ 
variable”  optimizatkm  in  [IS]),  but  the  focus  there  has  been  on  solving  die  more  critical  problem  of  optimizing 
joins  uid  odier  opaa&ms  that  generally  are  far  worse  botdenecks  in  imx^ssing  time  for  databases.  The  usual 
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methods  require  search,  either  exhaustive  or  heuristic,  in  the  space  of  possible  reanangements  of  a  quay  expres¬ 
sion,  radia  than  attempting  to  find  general  optimization  criteria.  Wodc  in  semantic  query  optimization  for  data¬ 
base  quoies  sometimes  suggests  signature-table  methods  [23],  duNigh  it  is  usually  concerned  with  application  of 
more  complicated  "integrity  OHistraints".  In  artificial  intelligence,  [25]  analyzed  (qttimization  of  filter  sequences 
that  create  persistent  variable  bindings,  a  different  (voblem  but  related  to  the  one  above.  Worit  in  Markovian 
decisirai  jmxesses  [17]  has  developed  general  methods  for  situations  more  complicated  than  sequences,  but  these 
are  not  very  efficient  for  sequences.  Work  on  optimal  decision  trees  assumes  all  c,  terms  are  equal,  which  leads 
to  specialized  algorithms.  Problems  of  task  scheduling  that  are  related  to  conjunctive-sequence  ordering  are 
generally  NP-complete  [9]  because  generally  we  must  to  examine  some  constant  fraction  of  all  possible 
sequences  in  orda  to  find  the  optimal  one.  But  in  this  paper,  we  will  propose  some  quick  polynomial-time  cri¬ 
teria  that  can  be  used  to  rule  out  all  but  a  few  possible  sequences,  as  for  instance  criteria  that  sort  the  sequence. 
While  these  criteria  are  not  guaranteed  to  find  the  optimsd  solution,  they  usually  do,  as  we  have  confirmed  by 
experiments,  and  furthermore  they  usually  greatly  improve  the  average-case  execution  time  of  the  sequence. 

2.1.  Local  criteria  for  interchange-optimality  of  the  cost  of  a  conjunctive  filter  sequence 

First,  consida  the  effect  on  cost  of  interchanging  filters  in  a  conjunctive  sequence.  If  a  sequence  is  a  local 
(^ttimum  with  re^tect  to  cost,  dien  any  such  intochange  must  not  decrease  the  total  cost.  Consider  the  effect  of 
interchanging  adjacent  filters  i  and  i+l.  This  certainly  cannot  affect  the  cost  terms  for  terms  before  i,  and  it 
cannot  affect  the  cost  terms  for  terms  after  i-f  1  because  /  i/y  2=/ t/Y  i-  So  the  interchange  of  filters  i  and  i+l 
will  not  improve  cost  if: 

or,  afta  converting  to  conditional  probabilities  which  represent  the  "value”  of  each  filter 
Ci+CuiP  (fi  1/ i/VaA  •  •  •  fi-i)  ^  Ci^iCiP  (/m  1/ i/Sf  zA  ■  •  • 

we  dien  get  afta  rearranging: 

Ci/Il-P(fi  l/iZ/zA  •  •  •  /.-i)]  S  c.>,/[l-p (/■,>, I/, /y2A-  •  •  fi-i)]  (2) 

In  other  words,  a  locaUy  optimal  sequence  must  be  sorted  by  c/q,  where  q  is  the  fraction  of  the  items  failing 
this  filta  afta  passing  all  imvious  filters.  We  call  this  "interchange  optimality”. 
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Often  we  can  assume  that  fiber-acceptance  evoits  fi  and  fi^i  are  im^mtdent  of  all  the  previous  filter- 
acceptance  events.  Then  the  conditional  piobal^ties  become  the  a  priori  probabilities  p if i)  of  a.  random  data 
itmn  passing  a  filter  i ,  and  we  get  (Thecnem  1  d  [14]): 

CiUf-pifi))  S  c.>,/(l-p(/.>i))  (3) 


Z2.  KatalHng  and  entailed  filters 

Unfortunately,  we  cannot  often  assume  independence  of  filter  pairs  because  the  whole  idea  of  signature  matching 
is  that  anything  that  passes  full  matching  will  also  pass  the  signature  matching.  Hence  the  conditional  probabil¬ 
ity  for  die  signature  filter  passing  a  data  item  that  the  full-miuching  filter  passes  is  1.  We  will  call  the  signature 
filter  die  ‘‘entaited"  filter,  and  the  full-matching  filter  the  "entailing”  filter.  We  will  assume  that  entailment,  or 
conditional  dependence  of  the  probability  of  filter  success,  is  always  absolute  if  it  occurs  at  all.  Filters  usually 
can  be  designed  to  accomplish  this,  though  it  should  be  noted  that  if  /  3  entails  /  j  and  /  3  entails  /  2,  then  / 1  and 
/  2  cannot  be  completely  indqiendent,  although  they  could  very  near  indqiendence. 

With  an  entailing  filter  e,  p if t\u)*p if ,)  where  u  represents  the  situation  of  passing  all  the  previous  filters 
before  e.  But  we  can  use  Bayes’  Rule.  Suppose  u  can  be  broken  into  two  pieces  such  that  u=UiAu2  and  U2  is 
the  largest  subset  indqiendent  of  /« .  Then: 

p(ft^u>p(ft^ui)=pif,ypiui)  (4) 

To  obtain  p(ui)  if  all  the  filters  in  ui  are  independent,  multiply  their  probabilities.  Otherwise,  since  entailment 

is  absolute  when  it  does  occur,  eliminate  the  probabilities  of  filters  that  are  entailed  by  others  in  u  1  and  then 

mult^ly  the  rest 

Fot  instance,  if  filter  7  oitails  filters  3  and  4  but  is  indqiendent  of  four  other  preceding  filters,  and  the  a-priori 
probability  of  passing  filtn*  3  is  0.4,  of  passing  filter  4  is  0.5,  of  passing  filter  7  is  0.1,  and  filter  3  is  near- 
indqiaident  of  filler  4,  />(f  7I/1  •  •  •  /y  6)=0.1/(0.4*0.5)=0.5.  If  filter  7  is  then  followed  by  a  filter  8  of  the  same 
cost  but  widi  an  indqiaident  success  probability  of  0.6,  filters  7  and  8  should  not  be  interchanged  in  search  of 
local  optimality. 
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Anodier  consequence  erf  our  assumption  of  all-or-none  ^pendence  between  filters  is  that  p(fi\u)  is  only 
different  firom  p(fi)  when  it  is  an  entailing  filter,  and  then  it  is  only  a  function  of  its  oitailed  filters.  But 
entailed  filters  must  precede  all  their  entailing  filters  in  a  given  filter  sequence  to  be  useful.  So  p  (/'Ju )  will  be 
a  constant  for  all  sensible  placements  of  an  entailing  filter  /j  in  a  filt^  sequence.  Hence  inequalities  (2)  and  (3) 
ate  sotting  critoia  for  a  bubble  sort  on  the  filter  sequence.  Hence  if  we  bubble-sort  using  interchange  q}timal- 
ity,  and  the  resulting  sequence  also  obeys  entailment  relationships,  we  have  found  the  optimal  order  for  the 
sequence  in  polynomial  time  with  respect  to  the  number  of  filters.  If  the  sort  result  does  not  obey  entailment 
relationships,  we  must  try  something  else.  Section  3.1  discusses  what  else  we  can  do.  and  gives  an  important 
theorem  for  this  situation. 

Note  that  even  if  entailment  is  not  all-or-none,  result  (2)  can  still  allow  sorting  of  a  filto*  sequence  if  the  varia¬ 
tions  in  the  conditional  probabilities  p(/'J/i  -  -  ‘  are  sufficiently  small  with  to  assign  a  consistent  place  to 
any  /,  in  the  sequence. 

23.  Deleting  entailed  filters 

Entailed  filters  must  come  before  their  entailing  filters  cr  else  they  are  useless.  There  is  now  a  new  possibility 
for  improving  filter-sequence  cost  without  changing  the  answers  it  produces:  deletion  of  an  entailed  filter. 

Assume  first  that  there  is  only  one  entailing  filter  and  oily  one  entailed  filter.  Assume  the  entailed  (fast)  filter  is 
i  and  the  entailing  (slow)  filter  is  e.  As  with  interchange,  the  deletion  cannot  affect  the  cost  terms  before  i.  It 
cannot  either  affect  the  cost  terms  after  e  because  filter  e  will  rule  out  an  item  regardless  of  how  much  assis¬ 
tance  it  gets  firom  filter  i.  So  deletion  of  filter  <  does  not  improve  cost  if: 

CiPifiN  -  •/i-iHci+iP(/'iA-  •  •  •/.-i/yiA/'i+i)+  •  •  •  +c,P(/‘iA-  •  •/,_!) 

^  Cj+ip  (/■  lA  •  •  fi-o+Ci^jp  /,-i/y.>i)+  •  •  •  +c,p  /i-i/y.+i  ■  ■  •  fe-i) 

or  after  introducing  conditional  probabilities: 

Ci+Ci^ip(fi  i/i  •  •  •  /y,-i)+Ci+2P(/’i/y.+ii/i  •  •  •  /sfi-iH  •  •  •  +c.p(/',  •  •  •  /v._ii/,A  •  •  •  /,_,)  (5) 

5  c,>i+c,+2P(fi+il/iA  •  •  •  fi-ih  •  •  •  +c,p(/’,+i  •  •  •  /y«-il/iA  ■  •  •  /,_i) 

We  call  this  condition  "deletion  optimality". 
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If  we  can  lesume  that  passing  (tf  filter  /,  is  indqiendent  of  passing  all  other  filters  fj  from  /=!  to  e-\  we  get 
after  rearrangement: 

C| /(l-p  {fi ))  5  C,+i+C,>2p  ifM  1/  lA  •  ■  •  /i-iH  •  •  •  +C,p  (/■;+!  ••■  /V,-i  1/  lA  •  •  •  /,_i)  (6) 

Note  that  the  right  side  is  just  in  the  notation  of  equation  (1),  or  the  expected  cost  of  the  filter  sequence 

/.>i./i+2 . /,  by  itself. 

Another  way  to  simplify  (5)  is  to  note  that  p((/iAr)l/ ,A  •  •  •  /,_i)  S  p(r  1/ jA  •  ■  ■  and  we  can  use  this  to 
match  each  pair  of  the  last  e-i-1  terms  on  the  left  side  and  the  right  side.  Then  eliminating  each  pair,  we  get  a 
simpler  sufficient  condition  for  (5)  to  be  true: 

Ci/[l-pVil/iA  (7) 

Note  that  condition  (7)  for  filter  i  implies  the  interchange  optimality  condition  (3)  for  filters  /  and  i+1,  since 


2.4.  Deletion  of  more  than  one  entailed  filter 

A  question  arises  about  deletion  optimality:  Even  if  filtm  i  and  y  are  individually  deletion-optimal  in  a  particu¬ 
lar  filter  sequence,  could  the  deletion  of  both  of  the  them  be  locally  optimal?  The  following  definitions  and 
Lemma  will  provide  the  criteria  for  such  stronger  forms  of  optimality. 

-A  filter  that  is  not  entailed  is  "strongly-deletion-optimar.  An  entailed  filter  i  is  strongly-deletion-optimal  if 
condition  (7)  holds,  l/j  •  •  •  /Vi-i))  <  Cj+j,  and  if  at  the  same  time  Ci+i<C;  for  all  i<j^.  where  r  is 

the  next  non-entailed  filter  after  i  jf  any,  else  the  last  filter. 

“A  filter  that  is  not  entailing  ts  "strongly-interchange-optimar  ^  it  is  interchange-optimal  by  (2)  with  respect  to 
all  other  filters.  An  entailing  filter  i  is  strongly-interchange-optimal  ^  Cy/(l-p(fy l/i  •  •  • /V;_i)) 
<  Ci/(l-p(fi)),for  all  r$,J<i  where  fj  is  not  entailed  by  /,  and  where  r  is  the  last  non-entailed  filter  before  i. 

Lemma  2.1,  Subsequence  Deletion  Suboptimality  Lemma:  Given  a  set  of  filters  in  which  every  filter  pair  is  either 
probabilistically  independent  or  else  one  filter  in  the  pair  entails  the  other.  Suppose  filter  i  in  some  sequence  S 
is  strongly  deletion  optimal.  Then  strong  deletion  optimality  of  the  filter  originally  at  position  i  must  also  hold 
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for  my  subsequence  created  by  deleting  items  besides  i  of  S.  Proof:  If  filters  deleted  from  sequence  S  occur 
after  I+l.  they  do  not  affect  either  side  of  (7),  so  it  still  holds.  If  filters  deleted  are  independent  of  filter  i ,  they 
cannot  affect  either  side  of  (7).  If  a  filter  j  before  filter  i  is  entailed  by  i  (i  could  not  entail  J  for  the  sequence 
to  make  sense),  and  j  is  deleted,  the  conditional  probability  can  only  decrease  since  p(f  i\b)<p  if  i\b  he),  so  (7) 
still  holds.  The  only  remaining  case  is  when  filter  i+l  is  deleted,  and  some  filter  j  to  the  right  of  it  now  fol¬ 
lows  i  in  the  sequence.  Then  if  Ci+i<Cy  for  any  filter  j  that  could  become  the  next  filter  after  /  by  deletions, 
then  (7)  holds  for  the  new  filter.  We  only  need  to  consider  filters  up  to  the  next  nonentailed  filter,  because  that 
one  cannot  be  deleted.  QED. 

We  can  use  this  lemma  to  get  sufficient  conditions  to  say  that  a  filter  sequence  is  the  globally  optimal  one  with 
respect  to  interchanges  and  deletions.  The  conditions  require  only  polynomial  time  to  confirm.  This  result 
^plies  to  an  impotant  class  of  problems,  and  filters  can  be  purposely  designed  to  make  global  optimality  easier 
to  guarantee. 

Theorem  2.1,  Restricted  Global-Optimality  Theorem:  Given  a  set  of  filters  in  which  every  filter  pair  is  either 
probabilistically  independent  or  else  one  filter  in  the  pair  entails  the  other.  Assume  in  some  filter  sequence  S 
that  every  filter  is  strong-interchange-optimal,  and  every  entailed  filter  is  strong-deletion-optimal.  Then  S  is  the 
global  optimum  in  the  space  of  improper  subsequences  created  by  deletions  from  it  and  permutations  of  it. 
Proof:  By  Lemma  1,  any  subsequence  T  created  solely  by  deletions  (with  no  interchanges)  must  also  be  strong- 
deletion-optimal.  Since  each  T  can  be  created  by  a  single  deletion  from  another  strong-deletion-optimal 
sequence  S,  no  T  can  be  the  global  optimum  because  it  is  more  costly  than  its  S. 

But  sequence  permutation  could  improve  the  cost  For  this,  we  need  only  consider  the  moving  left  of  an  entail¬ 
ing  filter  E  because  (a)  oitailing  filters  ate  the  only  ones  whose  interchange  c^timality  is  affected  by  deletions, 
and  (b)  their  ratio  c/(l-p)  will  be  decreased  by  deletions  of  leftward  filters,  so  they  may  need  to  be  moved  left 
to  put  the  sequence  back  into  sorted  order  on  c/(l-p).  However,  if  strong-interchange-optimafity  holds,  inter¬ 
change  optinuility  cannot  hold  with  the  new  left  neighbor  of  E  no  matter  where  E  is  moved.  Furthermore,  if 
strong-interchange-optimality  holds  with  all  filters  left  of  E  through  the  first  non-entailed  filter,  deletions  of  any 
of  them  cannot  affect  the  stxong-interchange-optimality  of  the  remaining  filters  because  p{fi)i^{f  ,  l/t).  QED. 
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Note  two  important  cases  to  which  the  Thewem  applies;  if  there  are  no  entailed  filters  in  a  sequence,  or  if  one 
entailing  filter  entails  all  the  others  (as  in  "multi-lever  signature  matching  [4]). 

The  conditions  of  the  Theorem  require  only  0  (m^)  operations  to  perform,  m  the  number  of  filters,  while  at  the 
same  time  {mining  a  possibly  ex{)onential  number  of  {lossible  {lermuted  subsequences.  So  a  good  heuristic  for 
finding  an  optimal  sequence  for  a  set  of  filters  is  to  first  interchange-sort  the  filters,  then  check  if  the  conditions 
of  Theorem  1  hold;  if  they  do,  you  have  the  global  optimum  and  are  done.  Theorem  1  can  also  rule  out  from 
consideration  subsequences  of  the  original  filter  sequence  even  when  it  does  not  apply  to  the  original  sequence. 

2.5.  Inserting  entailed  filters 

Another  way  to  improve  the  cost  of  a  sequence  is  to  insert  a  filter  redundant  with  res{)ect  to  an  existing  filter, 
before  that  latter  filter.  (It  makes  no  sense  to  inseit  a  nonredundant  filter,  as  without  such  a  filter,  the  final  out¬ 
put  of  the  filter  sequence  will  be  different)  This  is  the  exact  opjwsite  of  the  case  in  the  last  section,  so  we  just 
reverse  the  direction  of  the  inequalities  for  local  optimality,  with  the  proposed  insertion  filter  as  the  filter  i .  A 
way  to  rule  out  insotion-optimality  is  to  find  another  sequence  consisting  of  the  given  sequence  plus  one  extra 
filter,  where  tne  extra  filter  is  deletion-optimal.  This  is  straightforward  to  accomplish  if  filter  sequences  are  con¬ 
sidered  in  order  of  decreasing  size. 

2.6.  Distributed  filtering 

So  far  we  have  only  considered  sequential  implementation  of  a  conjunctive  filt^  sequence.  A  distributed  imple¬ 
mentation  could  assign  each  filter  to  a  processor,  send  each  filter  processor  each  data  item,  and  have  all  the  {iro- 
cessors  send  the  data  that  passes  their  tests  to  a  single  "intersection  processor"  that  finds  items  that  passed  more 
than  a  threshold  number  of  filt^.  There  are  two  problems  with  this  approach;  the  intersection  processor  is  a 
bottleneck,  and  the  degree  of  parallelism  in  the  first  {^ase  is  limited  by  the  number  of  filtos  which  can  be  dev¬ 
ised.  More  sqihisticated  information-retrieval  systems  today  allow  boolean  exinessions  with  embedded  condi¬ 
tions  instead  of  keyword  lists.  A  similar  distributed  sqiproach  can  also  be  followed  but  with  a  more  complicated 
final  aggregation  stq),  and  hence  an  ev«i  worse  bottleneck. 


I 
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Anodier  iqpproach  is  to  assign  to  sets  of  data  items  to  intx:essors,  which  is  possible  on  a  massively  parallel  com¬ 
puter.  [27]  expkmd  this  idea  on  the  Connection  Machine,  a  massively  parallel  machine  with  64.000  processors, 
with  each  processor  independently  comparing  keywords  with  its  own  associated  item  or  items.  The  keywords  in 
sequence  were  supplied  to  all  processors  simultaneously,  and  each  processors  counted  the  number  of  matches  for 
each  data  item.  At  the  conclusion  of  this,  processors  were  polled  to  get  the  answers.  This  approach  can  get 
answers  very  fast,  and  the  speedup  should  be  almost  linear  with  the  number  of  processors  used.  But  this  mas¬ 
sive  parallelism  also  means  massive  idleness:  Usually  most  data  items  do  not  match  the  query,  and  their  proces¬ 
sors  just  sit  uselessly.  So  this  approach  is  very  wasteful  of  computer  resources,  something  which  is  hard  to  jus¬ 
tify  for  a  non-critical  tq)plication  like  information  retrieval  and  a  multi-million  dollar  machine  like  the  Connec¬ 
tion  Machine.  Instead,  we  will  explore  partition  of  die  data  items  with  a  significantly  lower  degree  of  parallel¬ 
ism,  an  approach  that  could  woik  for  netwoiked  woikstations. 

2.7.  Data-partition  parallelism 

Suppose  we  have  N  processors  that  can  help  execute  the  filters,  where  each  processor  filters  a  randomly  chosen 
disjoint  partition  of  the  input  set.  Each  would  first  tqjply  filter  1  to  its  partition,  then  filter  2  to  the  output  of 
filter  1,  etc.  Assume  that  the  cost  of  tqiplying  a  filter  to  a  set  of  data  items  is  prc^rtional  to  the  number  of  data 
items;  this  is  true  for  most  filters,  including  signature  table  methods  as  well  as  the  natural-language  processing 
filters  discussed  later  in  this  paper.  Then  without  overhead,  each  filter  will  execute  N  times  less  time.  Other¬ 
wise,  we  can  assume  that  overhead  cost  is  {xoportional  (with  constant  to  the  number  of  processcHS  used.  So 
the  total  cost  of  doing  a  sequence  of  filters  with  data-partition  parallelism  is  kgN  +{c ilN}Hp{f  •  •  • , 

which  has  a  minimum  with  respect  \oN  an: 

^^6«f=V(l/A:„)(ci+p(/‘i)c2+  •  •  • )  (6) 

This  must  be  rounded  to  an  integer,  so  this  true  optimum  is  usually  only  approached. 

Usually  this  is  a  large  number,  larger  than  N ,  since  rqnesents  the  time  to  send  simple  messages  to  other  {xo- 
cessors,  which  is  usually  much  fast^  than  executing  filters  themselves.  Then  since  the  derivative  of  the  cost  is 
negative  for  N<N^,,  the  minimum  occurs  at  N,  and  it  best  to  use  all  processors.  This  is  only  different  for 
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filteis  that  are  all  very  ample,  with  hardware  that  poorly  supports  message  passing,  or  situations  with  a  large 
number  of  availalde  processors  (something  difficult  to  justify  for  information  retrieval).  And  with  a  sequence  of 
filters,  the  setup  cost  k^N  need  only  be  incurred  once  because  each  processor  tqtplies  the  data  to  each  filter  in 
turn,  so  its  cost  can  be  amortized;  and  the  final  union  of  results  is  just  a  disjoint  union,  easy  to  accomplish. 

But  we  need  not  assign  all  the  processors  to  the  same  filter  at  the  same  time.  We  could  assign  some  of  the  N 
processors  to  one  ilta  and  the  remainder  to  another  filter.  Surprisingly,  we  can  prove  this  is  never  desirable, 
under  a  few  simple  assumptions. 

Theorem  22,  Parallel-filter  Theorem:  Suppose  we  have  N  processors  to  implement  two  information  filters.  Sup¬ 
pose  the  cost,  per  unit  number  of  data  items,  of  n  processors  doing  a  filter  i  is  g(n)+{Ci/n),  g(n)  a 
commumcations-cost  function  (covering  the  processor  setup  and  result-list  intersection  for  the  parallel  process¬ 
ing),  and  Ci  the  cost  of  the  filter  per  data  item  as  above.  That  is,  we  decompose  cost  into  a  linear-speedup  term 
and  an  overhead-cost  term.  Assume  g"(«)S0  and  g(0)=0  (concavity  would  reflect  economies  of  scale  in  invok¬ 
ing  processors).  Then  it  is  best  to  apply  all  N  processors  to  one  filter,  then  all  N  processors  to  the  other  filter. 
Proof:  Suppose  we  do  assign  nj  processors  to  filter  1  and  N-n^  processors  to  filter  2,  where  n\^ .  Then  exe¬ 
cution  time  for  just  the  two  filtos  plus  overhead  will  be  g(«i)+g(N-ni>+max((ci/«i),(c2/(lV-ni))).  As  a  func¬ 
tion  of  real  nj,  (ci/ni)  is  monotonically  decreasing  and  (cyfN-ni))  is  monotonically  increasing.  Hence  the 
minimum  of  the  max  term  wiU  be  when  Ci/nj=C2/(JV-«i),  or  when  ni=Ncil(ci-^c-i),  at  which  value  the  cost 
attains  a  minimum  of  g(/Vci/(ci+C2)>»-g(lVc2/(ci+C2)>K(<^i+f2y^)-  O”  ihe  other  hand,  if  all  N  processors  are 
assigned  to  poform  filtering  opoadon  1,  then  filtering  operation  2,  the  cost  of  the  two  filters  plus  overhead  will 
be  g{NyH.cilN)HpicfN)  where  pj  is  the  probability  of  passing  filter  1.  If  we  ignore  the  overhead  terms,  the 
parallel-filter  (first)  approach  cannot  be  preferable  since  that  would  mean  (cfNyiipicfN),  which  is  impossible 
since  p  i<l.  Considering  just  the  overhead  terms,  the  parallel-filter  approach’s  g  terms  would  only  be  preferable 
if  for  some  integer  n,  0<«S/V.  But  this  is  impossible  because  g"(n)<0,  so 

g(A^/2^(g(O>+-g0V))/2=O.5g(JV)  and  g(n>4-g(lV-n)<g(0)+g(N).  Hence  we  have  shown  that  both  filter  cost  and 
overhead  cost  are  infoior  for  the  parallel-filter  tqrproach. 


The  above  analysis  is  actually  conservative.  Note  that  to  impl^ent  parallel  filtering,  we  must  round  from 
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Ncif(ci-¥cj)  to  an  integer,  this  can  only  worsai  the  cost  further.  Furthermore,  the  idx>ve  results  assume  that 
with  sequential  filtering,  all  N  {wocessors  do  filter  1,  then  all  N  do  filter  2,  etc.  But  if,  say.  iHOcess(»  6  finishes 
filter  1  early,  it  could  start  applying  filter  2  to  its  results  of  filter  1,  befcue  processor  S  finishes  filter  1.  This 
inierieaving  ^ect  is  data-dqwndent  and  hard  to  analyze,  but  could  only  be  better  without  parallel  filtering, 
because  then  each  processor  does  each  filter  and  there  are  more  t^qxMtunities  to  finish  early  and  do  the  next 
filter.  QED. 

The  following  corollary  generalizes  this  result  on  two  filters  to  arbitrary  execution  plans  involving  parallelism 
for  a  set  of  filters. 

Corollary  2.1:  Given  an  execution  plan  for  a  set  of  filters  on  N  processors,  expressed  as  a  directed  acyclic 
graph  with  each  node  marked  as  to  which  filter  and  which  of  N  subdivisions  cf  the  data  to  which  it  applies,  and 
where  a  filter  can  appear  more  than  once.  Suppose  the  cost,  per  unit  number  of  data  items,  of  n  processors 
doing  a  filter  i  is  g(n}Hci/n),  g(n)  a  communications-cost  function  (covering  the  processor  setup  and  result- 
list  intersection  for  the  parallel  processing),  and  c,-  the  cost  of  the  filter  per  data  item  as  above;  and  assume 
g"(n)S0  and  g(0>=0.  Then  if  that  execution  plan  has  different  filters  in  parallel  anywhere,  it  is  not  optimal. 
Proof:  Such  an  execution  plan  could  be  transformed  into  a  sequence  of  filters  by  repeatedly  taking  a  pair  of 
parallel  suands  and  sequencing  them  arbitrarily.  Then  if  we  reverse  the  order  of  these  *;  -formations,  we  will 
get  the  original  execution  sequence.  But  in  this  latter  process.  Theorem  2  ^plies  at  every  step,  so  the  original 
execution  plan  cannot  be  optimal  even  if  the  completely  sequenced  plan  is  optimal.  Furthermore,  if  a  filter 
iqrpeared  more  than  once  in  the  original  execution  plan,  it  will  appear  more  than  once  in  the  completely 
sequertced  plan,  so  that  plan  cannot  be  optimal  anyway.  QED. 

As  an  example  of  the  Corollary,  suppose  we  have  three  filters  /i,  fi,  and  / 3.  Then  putting  /i  in  parallel  with 
the  sequence  of  fi  followed  hy  f^  cannot  be  optitnal  because  the  last  two  can  be  considered  a  composite  filter. 
Similarly,  suiqxrse  we  do  /i  then  /a  on  one  parallel  track,  /2  then  f  3  on  another,  where  the  starting  time  of /3 
on  the  first  track  could  be  different  than  the  starting  time  on  the  second;  then  sequencing  would  have  each  filter 
once  plus  /  3  twice,  which  is  worse  than  just  having  each  filter  once. 
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IJL  Parallel  Boo-fitteriog  processes 

Information-fillBiing  applications  can  require  additional  non-filtering  processing.  In  natural-language  infumation 
retrieval,  for  instance,  parsing  and  inteipretation  of  the  namral  language  is  necessary  before  detailed  matching 
can  be  done.  The  effect  of  such  processes  can  be  modeted  as  imposing  earliest  stait  times  for  all  the  i^ocesses 
that  dqtend  (mi  them.  This  introduces  additional  inequality  ctHistiaints  of  a  more  traditional  sort  into  our 
scheduling  problem. 

We  can  handle  these  constraints  in  the  standard  manner  of  optimization  theory.  If  the  local  opdinality  condi¬ 
tions  can  be  satisfied  without  violating  the  new  start-time  ctmstraints.  the  local  optimum  remains  a  local 
(^mum.  Otherwise,  die  local  optima  must  be  on  the  border  of  the  region  d  feasibility  with  the  minimum 
number  of  "active"  constraints  inequalities  reducing  to  equalities).  That  means  diat  any  local  (qilimum  of  the 
new  problem  must  satisfy  interchange  optimality  and  local  deletion  optimality  excqit  in  the  minimum  number 
places  necessary  to  satisfy  the  start-time  constraint  Standard  algorithms  can  solve  such  problems. 

IS.  General  idgorithm  for  sequence  optimization 

We  can  now  provide  a  genoal  method  for  filter-sequence  optimization.  First  sort  the  filters  for  interchange 
optimality.  If  Theorem  2.1  tqiplies  to  this  sequence,  and  it  obeys  dependencies,  it  is  the  global  optimum.  Other¬ 
wise,  rearrange  to  satisfy  the  dependencies,  and  conduct  a  branch-and-bound  search:  Consider  possible  subse¬ 
quences  of  the  stated  filter  sequence  created  by  deleting  entailed  filters  that  are  not  strongly-deletion-t^timal  and 
resorting  to  preserve  intmchange  optimality.  Any  such  subsequence  that  satisfies  IheOTem  2.1  and  dependencies 
lepresoits  a  local  optimum  for  the  original  sequence  of  filt^,  and  furthermore  none  of  its  subsequences  can  be 
a  local  optimum.  Thai  if  N  processors  are  available,  assign  all  7^  for  each  filter  in  turn  if  N<N^,,  otherwise 
use  Ni„,.  This  algorithm  has  exponential  time  complexity  in  the  number  of  filters  because  of  the  possibly 
etqwnential  number  of  combinations  diat  must  be  considered,  but  it  is  simple  and  works. 
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3.  ExpcrlBCBls  with  rudon  data 

Aiudysis  of  filler  raecution  plans  can  vary  greatly  in  difficulty  depending  on  the  pwameters  of  the  filters 
involved.  To  better  judge  the  number  of  local  optima  and  how  often  the  global  optimum  is  easy  to  find,  we 
conducted  experiments  with  randomly  generated  filters.  Given  a  particular  number  of  filters  to  create,  we  ran¬ 
domly  designtted  certain  (Mies  as  entailing  filters,  and  randomly  chose  some  filters  fcM-  them  to  entail.  Entail- 
ment  relationships  were  restricted  to  f(»m  a  f<»est,  since  signature-table  filters  do,  and  it  is  difficult  and  not  very 
useful  for  filter to  be  entailed  by  both  filters  and  fc  when  /j  and  fc  are  unrelated.  Note  that  deletion  of 
a  node  firom  a  forest  and  rerouting  the  children  ncxles  maintains  the  forest  prqierty.  (The  forest  restriction  does 
allow  a  filter  to  entail  both  fg  and  fc ;  for  instance,  /d  could  be  a  full  match  to  the  data  item,  fg  a  match 
to  its  high-order  bits,  and  /c  a  match  to  its  low-order  bits.) 

Costs  were  randomly  assigned  to  each  filter  from  the  unifonn  distribution  0  to  10.  Filter  success  probabilities 
were  assigned  firom  the  uniform  distribution  0.01  to  0.99;  fiv  nonentailing  filters,  this  was  taken  as  the  a  priori 
probability,  and  for  entailing  filters,  this  was  taken  as  the  conditional  probability  given  that  all  previous  filters 
succeeded. 

Fig.  1  shows  an  example  run  with  four  randomly  generated  filters.  The  first  argument  to  "filter"  is  the  filter 
number,  the  second  its  average  cost,  and  the  third  its  probability  infcHrmation.  "Dqi”  means  the  filter  entails 
those  Usted,  so  filter  2  entails  1,  and  filter  3  entails  both  1  and  2.  There  are  four  sequences  of  filter  subsets  that 
satisfy  interchange  optimality  and  the  dependencies.  Of  these,  two  satisfy  deletion  optimality  criteria  (more 
often,  (Mily  (Mie  does).  The  first  is  the  true  optimum,  and  die  heuristic  greedy  algcaithm  finds  it  Note  how  dele¬ 
tion  interacts  with  interchange  (^Mimality:  Filter  4  normally  should  precede  filter  3,  but  when  both  filters  1  and  2 
are  deleted,  filter  3  becxMnes  m(»e  valuable  and  must  precede  4. 

Fig.  2  tabulates  mperimental  results  which  are  graphically  represented  in  Figs.  3-11.  Each  row  of  Fig.  2 
rqiresents  1000  randomly  gennated  filter  sets.  The  first  column  is  the  number  of  filters,  the  sec(Mid  the  proba¬ 
bility  that  a  filter  would  be  selected  for  possible  entailment  (although  it  would  be  ruled  out  if  such  an  entailmmt 
would  violate  the  forest  iMoperty),  and  the  third  the  probability  that  a  filter  is  entailing  (that  is,  whether  we 
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shottld  attempt  to  consinict  entailment  relationships  for  it).  The  remaining  columns  show  e]q>enmental  results  as 
means  of  nttutal  logarithms,  widi  associated  standard  errors  in  parentheses.  We  use  the  logarithms  because  they 
are  mote  t^tpropriate  for  summarizing  combinatorial  experiments.  The  fourth  column  is  the  mean  of  the  loga¬ 
rithms  of  the  number  of  possible  subsequences  that  need  to  be  considered,  after  interchange  sorting,  for  each  set 
of  randomly  generated  filters.  The  fifth  column  is  the  mean  of  the  logarithms  of  the  number  of  those  subse¬ 
quences  that  were  judged  locally  optimal  with  respect  to  the  criteria  of  section  2,  a  number  generally  consider¬ 
ably  smaller  than  diat  of  the  previous  column.  Note  die  values  in  the  fifth  column  increase  more  slowly  than 
the  values  in  the  fourth  column. 

Fig.  3  plots  the  size  of  the  search  qiace,  the  total  number  of  sequences  n !  for  n  filters,  with  the  values  displayed 
in  fourth  column  oi  Fig.  2;  four  assignments  of  dqien^ncy  parameters  are  shown.  Figs.  4-7  show  diese  last 
four  curves  plotted  against  the  numbo'  of  locally  optimal  sequences,  the  fifth  column  of  Fig.  2. 

3.1.  A  peedy  algorithm 

The  sixth  column  of  Hg.  2  shows  the  pofcwmance  of  a  simple  heuristic  "greedy"  algorithm  to  find  die  optimum, 
in  terms  of  the  means  of  the  logarithms  of  the  ratios  of  the  cost  of  the  sequence  found  by  the  algorithm  to  the 
cost  of  the  true  optimum  se^ience.  This  algorithm  riaits  with  the  interchange-sorted  (see  discussion  below)  list 
of  filters,  and  successively  deletes  the  best  filter  that  it  can  (that  is,  the  entailed  filter  whose  deletion  improves 
overall  cost  die  most  after  resorting),  until  no  further  deletion  can  improve  cost.  No  backtracking  is  done.  This 
heuristic  algorithm  is  0(m^),  m  the  number  of  filters,  since  interchange-sorting  is  (7(mlogm),  and  there  are 
0(m)  things  to  ddete  and  hence  0(m)  stqis;  each  step  looks  at  0(m)  subsequences  and  evaluates  the  cost  of 
each  subsequence  in  0(m)  time,  then  resorts  in  0(m)  time  to  reposition  one  entailing  filter.  The  heuristic  algo¬ 
rithm  cannot  get  the  optimal  solution  in  every  case  of  the  general  proUem.  But  the  sixth  colunm  demonstrates 
that  it  nearly  always  gets  die  conect  answo-  to  up  to  fifteen  filters,  and  its  rate  of  detmtvation  is  considerably 
slower  than  the  increases  in  the  size  of  the  proUem  space,  the  number  of  sequences  considered,  and  the  number 
of  local  optima.  Rgs.  8-11  plot  columns  S  and  6  of  Fig.  2  against  one  another. 

The  greedy  algorithm  is  sunxxted  by  the  following  useful  theorem  that  relates  interchange  tqitimality  and 
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(ktadion  cfNtaaiiqr.  This  says  that  we  need  only  inteichange-aoft  the  filter  set  in  older  to  check  for  a  local 
optimum:  If  the  inteichange  sorting  does  not  obey  entailment  relmionships,  it  is  not  locally  optimal. 

Theorem  3.1:  If  a  conjunctive  filter  sequence  is  sorted  so  as  to  sad^  interchange  optimality,  using  for  each 
entaiUng  filter  its  conditional  probability  given  all  entailed  filters  are  present,  and  this  sequence  violates  entail¬ 
ment  relationships,  the  sequence  is  not  deletion-optimal.  Proof:  Suppose  a  sequence  is  inteichange-scmed  so  that 
entailing  filin'  /«  occurs  b^oce  its  entailed  filter  where  d  is  the  costliest  in  c/(l-p)  of  all  such  entailed 
filters  after  e.  Then  Cf/(l-p(ftlut))<Cj/(l-p(fd^uj)).  Now  consider  moving  d  beftMe  e  in  order  to  satisfy 
entailment  Since  d  is  the  ratio-costliest  of  filters  entailed  by  r  and  aftn  e,  it  is  also  the  ratio-costliest  of  all 
filters  entailed  by  e ,  and  must  go  just  before  e  to  achieve  the  locally-q)timal  radering  of  all  sequences  with  d 
before  e.  But  with  d  just  before  e,  the  condition  for  deletability  of  d  is  that  c^/(l-p(/'dlu^))>c«.  But  this  fol¬ 
lows  from  the  original  assumptions  since  c«<c«/(l-p  (f,  lu«)).  QED. 

4.  Disjunctioiis  and  negatioiis 

We  now  extend  the  previous  roialysis  to  filter  execution  plans  that  are  equivalent  to  aibitrary  boolean  expres- 

SiCRS. 


4.1.  Ordering  of  disjunctions 

We  expect  that  disjunctions  of  filters  will  be  rather  rare,  as  conjunctions  of  conditions  are  more  useful  in  infor¬ 
mation  retrieval.  But  in  general,  optimization  of  disjunctions  is  precisely  analogous  to  (the  "dual"  of)  optimiza¬ 
tion  of  conjunctions.  The  following  theorem  generalizes  Theorem  2  of  [14]  beyemd  indqiendent  filters. 

Theorem  4.1:  The  problem  of  optimally  ordering  a  disjimetive  sequence  of  filters  is  equivalent  to  optimally  ord¬ 
ering  a  conjunctive  sequence  in  which  the  costs  are  the  same  and  the  probabilities  are  mapped  to  their  inverses. 
Proof:  the  filters  are  applied  sequentially,  then  everything  that  fails  the  first  filter  is  relied  to  the  second 
filter,  and  everything  that  fails  both  filters  is  aiqtiied  to  the  third  filter,  and  so  on.  The  final  answer  is  just  the 
qipending  of  results  from  all  filters,  since  this  union  is  disjoint.  Hence  the  cost  formula  is 
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Ci-fCjpC/iHc^r/ i^V .  idmtical  to  that  for  cmtjunctive  sequraces  except  with  the  inverse  ptobabili- 
tks.  But  the  inverse  function  /  is  monotonic  and  has  the  same  domain  and  range.  So  the  inoSlem  of 

finding  an  optimal  disjunctive  sequence  has  an  exact  ’’dual"  ptobtem  of  finding  the  optimal  conjunctive  sequence 
for  the  inverse  of  die  original  probabilities.  The  solution  to  the  lattn-  found  by  the  abovementioned  methods 
then  nuqis  to  the  solution  fiir  the  former.  QED. 

Note  that  this  also  provides  a  criterion  for  deletion  of  redundant  disjunctive  filters,  and  implies  that  concurrent 
execution  of  different  filters  in  disjunctions  is  also  not  advantageous. 

4J.  The  distributive  bws 

If  a  filter  expression  includes  both  conjunctions  and  disjunctions,  should  we  factor  it  (using  the  distributive  laws) 
to  improve  execution  time?  Surprisingly,  the  answer  is  an  unequivocal  ”yes.” 

Theorem  42:  With  three  arbitrary  nontrivial  filters,  the  execution  plan  is  preferable  to  the 

equivalent  plan  (f  1//3).  Pfoof:  Let  u  rqnesent  all  events  before  / 1  is  evaluated.  The  first  execution 

plan  is  better  dian  the  second  if: 

Cl  +  C‘J>if\\u)  +  C3p(f  lAVll")  <  Cl  +  CtPVilw)  +  CiPC/iV*/2Ih)  +  C3P(/’iA*/2*«) 
which  simplifies  to  Occip  C /iV*/ ilujf  which  is  always  true  for  nontrivial  filters.  QED. 

Hence  an  additional  local  optimality  condition  fw  an  ^ecution  plan  involving  both  conjunctions  and  disjunc¬ 
tions  is  that  all  possible  foctorings  be  made  for  conjunctions  over  disjunctions.  A  similar  result  holds  for  dis¬ 
junctions  over  conjunctions. 

Theorem  43:  The  execution  plan  /iV(/' 2/^/3)  is  always  preferable  to  the  equivalent  plan  (f  ^f  2)^^ i^f  i)- 
Proof:  Similariy  to  die  precediog,  we  get: 

Cl  +  C2pr/il«)  +  CaPC/i/Vjl")  <  [Ci  +  C2pr/il«))  +  [CiPC/’iV/jIm)  +  CjpCfiN'i^u)] 

And  all  terms  cancel  exeqM  for  the  third  on  the  right  side,  Cip(/'iV/2lu).  QED. 

Note  that  Theorems  42  and  4.3  do  not  require  probabilistic  indqiendence.  And  the  /,  terms  can  be  composite 
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(or  logical  combinations  oi  other  filters),  so  the  result  appiks  to  the  distrfiMition  oi  conjunction  over  more  than 
two  di^jimctions,  and  vice  versa.  But  the  above  results  do  only  provide  local  opdinality  conditions,  because 
some  boolean  expressions  can  be  factored  in  more  than  one  way.  For  instance,  there  are  two  local  optima 

(aA6Mfl^)V(bAc)  =  (aA(hV))V(i>Ac)  =  (o  Ab  M(a  Vft  )Ac ) 

Ifowever,  this  possibility  can  u^ially  be  safely  ignored,  since  conjunctions  are  the  most  natural  way  to  conjoin 

nearly  all  filters;  and  when  disjunctions  do  occur,  it  is  unlikely  that  a  filter  must  ^ipear  twice  in  any  locally- 

q>timal  execution  plan,  as  is  necessary  above. 

4J.  Redundancy  elimination  in  boolean  expressitms 

A  qiecial  case  of  the  distributive  law  is: 

/WiATi)  =  (fiNrue)Wi^^  =  fiHtrue\ff^  =/, 

The  final  expnsskm  must  execute  faster  than  the  original  expression  because  it  is  a  subexpression.  The  other 
"absorption"  law  is  fiNf  iV/j)  =  /,. 

In  general,  all  such  redundaicy-elimination  laws  of  logic  can  be  fruitfully  applied  to  boolean  expressions  of 
filter  combinations.  They  include: 

/  fV  -f ,  f'^f  =f ,  f  f^rue=f ,  f  f^alse=f alse ,  fytrue=true ,  fyf  alse=f 
All  these  involve  substitution  of  an  expression  requiring  less  work  to  evaluate. 

4.4.  Negations 

Negation  (iterators  will  complete  a  boolean  algebra  of  filter  expressions.  Since  filters  always  operate  on  a  finite 
set,  it  is  helpful  to  diink  of  the  negations  as  being  set  differences  on  either  the  results  of  filtos  previously 
passed,  if  ai^,  or  otherwise  the  full  database.  And  we  will  automatically  diminate  double  negations  fixxn  a 
boolean  expression. 

We  win  assume  diat  the  negation  of  a  noncomposite  filter  has  the  same  evaluation  cost  as  die  unnegated  filter. 
This  is  true  for  signature  tables  and  other  filters  wherein  a  similar  calculation  is  performed  upon  every  input  data 
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item.  and  the  lesult  used  to  decide  if  the  item  passes  the  test:  more  complex  filter  that  do  imt  fulfill  this  resuic- 
tkm  can  often  be  decomposed  into  boolean  combinations  of  subfilters  that  do.  Under  this  assumption,  we  can 
prove  that  negafitms  in  a  boolean  filter  exjHession  should  be  pushed  as  Car  as  possible  inside  exfvessions,  so  that 
they  all  apply  to  single  filters  and  thus  can  be  evaluated  at  no  cost  penalty. 

Theorem  4.4:  Consider  an  equivalence  class  of  boolean  expressions  such  that  any  member  of  the  class  can  be 
tran^ormed  into  any  other  member  by  some  sequence  of  applications  ofDeMorgan's  Laws.  Then  if  this  expres¬ 
sion  is  interpreted  as  r^erring  to  irtformation  filters,  the  globally  optimal  member  of  the  class  is  that  in  which 
every  negation  is  of  a  noncomposite  (simple)  filter.  Proof;  Tha*  must  be  only  one  such  expression,  because  the 
laws  above  t4)ply  indqiendently  to  each  negation  sign.  Any  other  expression  in  the  equivalence  class  must  be 
derivable  by  a  series  of  applications  of  the  "negation-factoring”  forms  of  DeMorgan’s  Laws. 

First  consider  /j/y  2  versus  *("/ ^ff).  If  u  refuesents  the  {vevious  context  of  the  subexpressions,  and  / ,  and 
/2  are  simple  filters,  the  first  expression  costs  Ci+p(filu)c2.  and  the  second  costs  Ci+p(filu)c2+c„,  where  c„ 
is  the  "negation  cost",  the  cost  per  data  item  of  checking  which  items  in  a  set  do  nor  belong  to  u .  Hence  with 
simple  but  nontrivial  filters,  the  first  form  is  always  better  because  all  the  terms  are  the  same  except  for  the 
added  negation-cost  term  in  the  second  expression.  Second  consider  /1V/2  versus  “{‘fit^ff)-  The  first 
expression  costs  Cj+pC / and  the  second  costs  cyt-pCf  il«)c2+c,.  Again,  the  second  cost  is  worse  for 
simple  nontrivial  filters. 

If  /i  and  /2  are  composite  in  the  above  filto*  expressions,  however,  the  transformation  from  the  first  form  to  die 
second  could  decrease  cost  if  /i  and  f  2  are  themselves  negated  exinessions,  in  which  case  double  negations 
cancel.  Ifowever,  the  cost  calculation  for  die  second  form  of  each  pair  always  adds  at  least  one  c„  term.  Hence 
its  cost  is  always  worse  dian  that  of  an  expression  in  the  same  equivalence  class  widi  no  c„  terms  in  its  cost,  the 
expression  with  negations  pushed  all  the  way  inward  QED. 

Besides  DeMOTgan’s,  a  few  other  laws  of  logic  that  can  help  simplify  an  expression.  Certainly  f  ihc  f  \=f  alse 
and  fiV'fi=true  are  desirable  substitutions.  But  the  "negative  tibsorption"  laws  /  A"/ 1^/2)=/ 1^2  and 
fi^Cfi/^T^fi^ft  are  unnecessary  since  they  just  represent  the  way  we  execute  filters. 
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45.  Stowhtad  MawiHiii  for  optfana 

Hg.  12  summarizes  the  main  results  of  this  paper  so  far.  We  have  covued  enough  of  the  laws  of  boolean  logic, 
the  propositional  calculus,  to  generate  all  equivalent  expressions  to  an  arbitrary  input  exinession,  since  we  have 
covoed  all  those  listed  in  [IS],  The  middle  column  shows  that  the  first  four  classes  of  logical  equivalences  do 
require  some  cost  and  probability  analysis;  but  the  methods  of  section  3  will  do  this,  and  they  are  not  hard.  The 
rightmost  column  shows  that  three  of  the  seven  classes  of  equivalences  are  possible  problems  for  a  "greedy" 
algorithm  which  sequentially  applies  the  best  equivalence  until  it  reaches  a  local  optimum.  These  three  are  what 
make  the  general  proU^  difficult  and  probably  exponential  in  complexity.  However,  results  of  section  3  sug¬ 
gests  that  a  polynomial-time  greedy  algcmthm  can  get  the  right  answer  most  of  the  time.  If  this  is  insufficient 
assurance  of  optimality,  simulated  annealing  can  expitse  the  search  space  randomly  until  some  given  level  of 
assurance  is  achieved.  The  necessary  random  transformations  of  the  boolean  expression  can  include  all  the 
methods  in  Fig.  12. 

5.  Experiments  with  a  natural-bnguage  processing-filter  application 

We  now  discuss  a  qtecific  filtering  replication  to  which  we  have  iqjplied  our  theory.  This  application  illustrates 
a  number  of  subtleties  in  the  use  of  filters.  It  also  is  valuable  in  its  own  right,  as  one  of  the  easiest  ways  for 
users  to  access  multimedia  databases.  The  idea  is  provide  information  retrieval  of  multimedia  data  with 
natural-language  (in  our  case,  English)  questions  as  input.  Examples  of  this  ^proach  ate  [13]  and  [22]  and  the 
nune  complicated  ideas  reviewed  in  [23].  Natural-language  processing  poses  a  good  challenge  for  filtoing  ideas 
because  it  can  requite  much  time,  yet  it  is  not  so  slow  as  to  fail  to  benefit  from  a  modest  improvement  in 
efficiency. 

We  wish  to  improve  our  earlio’  MARIE  system  [21]  with  the  ideas  of  this  p^>er.  The  goal  of  MARIE  is  to  lake 
as  input  a  English  noun  phrase  r^nesenting  a  query,  and  return  as  ouq>ut  the  multimedia  data  it«ns  that  match 
the  meaning  (as  onx>sed  to  literal  words)  of  the  query,  doing  most  processing  widi  the  pointers  to  those  data 
items  and  not  the  items  themsdves.  The  domain  of  MARIE  is  captkmed  photographs  in  die  Photo  Lab  of 
NAWC-WD,  China  Lake,  California.  The  conceptual  units  of  the  improved  MARIE  will  be: 
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1.  Coarse-grain  matcher  (C):  a  keywod  match  of  nouns  in  the  English  query  input  to  ctqxion  nouns, 
using  index  files  [21],  It  returns  a  set  of  pointers  to  media  data  itons.  This  process  must  use  a  type 
hierarchy  to  be  truly  he4)ful.  as  observed  in  [6]  and  [26], 

2.  Parser  (P):  a  natural-language  understanding  system  that  parses  English  query  input  and  creates  a 
meamng  list,  its  logical  fonn  [21].  We  assume  input  and  ciq>tions  exhibit  ‘conjunctive  semantics"  [1] 
where  the  meaning  of  the  whole  is  the  conjunction  (tf  a  set  of  logical  expressions  that  define  the  meaning 
of  the  parts,  a  usually  reasonable  assumption. 

3.  Registration-data  tester  (R):  a  fomatted-condition  processes,  like  those  in  database  query  languages, 
that  returns  pointers  to  data  items  matching  registrati(m-data  (formatted  non-ctq>tion)  conditions  in  die 
query  input  Registration  data  at  NAWC-WD  includes  date,  location,  photognqiher.  type  of  film,  and 
security  classification.  This  tester  was  implemented  for  this  p^r  using  a  main-memory  database. 

4.  Picture-type  matcher  (T):  a  process  that  identifies  die  possible  broad  classes  to  which  a  media  datum  or 
a  quay  can  belong  (like  "test"  or  "historical"  or  "public  relations"  for  photographs),  and  rules  out  media 
data  whose  classes  are  incompatible  with  the  query  classes  [20]. 

5.  Fine-grain  matcher  (F):  a  gttqih  matcher  that  checks  whetha  the  query  input  graph  (representing  the 
qiKfy  meaning  list)  is  isomorphic  to  some  part  of  some  c^don  grsqih  (rqrresenting  the  ctqition’s  meaning 
list)  [21].  Like  the  coarse-grain  matcher,  diis  needs  a  type  hierarchy,  and  it  also  needs  a  part-whole 
hierarchy.  It  he^ts  to  sqnrate  this  from  the  coarse-grain  match,  as  did  [7,  18],  since  it  requires  combina¬ 
torial  analysis  and  can  be  much  slower. 

Hg.  13  shows  the  dqiendencies  between  the  processes.  Each  of  these  can  be  a  separate  process  on  a  separate 
processor,  input  and  ouqnit  queues  can  enable  asynchronous  communication.  Items  1,  3,  4,  and  S  ate  conjunc¬ 
tive  information  filtos  as  we  defined  them  in  sectitm  2;  and  iton  2  imposes  a  minimum  start-time  constraint  on 
4  aid  S.  Two  additional  filters  that  could  also  be  added  are  an  automatic  "content-analysis"  filter  that  analyzes 
the  media  data  for  things  and  relationships  displayed  within,  and  the  human  usa  who  accepts  or  rejects  what  the 
computer  eventually  suiq;)Iies.  We  could  also  add  sqiarate  filters  fta  additional  textual,  audio,  and  video  aspects 
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of  a  datum,  if  theae  could  be  analyzed  sqnntely. 

5.1.  MMhematkal  anatysis  of  the  four  MARIE  filters 

We  can  exhaustively  consider  all  possible  orderings.  Entailmeats  can  be  summarized: 
•>C-R:  independent,  following  argument  above 
~C-T:  approximately  inde.  pendent  (two  different  kinds  of  reasoning) 

-•C-F:  first  entailed  by  second 

-R-T:  independent  (they  examine  different  sorts  of  data) 

-R-F:  independent 

-T-F:  first  entailed  by  second 

Then  the  only  possible  filter  sequences  obeying  dependencies  are: 

4-filter.  CRTF,  CTRF,  RCTF,  RTCF,  TCRF,  TRCF.  CTFR,  TCFR 
3-filta:  RTF,  <3tF.  TRF,  RCF,  TFR,  CFR 
2-filter  RF,  FR 


We  randomly  chose  230  captions  from  the  NAWC-WD  Pimto  Lab  database.  We  used  44  test  queries,  42  sup¬ 
plied  by  the  Photo  Lab  personnel  as  typical  of  the  queries  diey  receive  everyday,  and  two  longo-  queries  from 
[20].  We  obtained  average  CPU  times  per  data  item  (in  seconds)  and  average  success  probabilities  in  experi¬ 
ments  widi  an  implementation  in  Quintus  Prolog  on  a  Sun  SparcStation.  These  parameters,  plus  the  ratios 
r,=Ci/(l-pi),  are: 

cc=0.0102,pc^.0305,  rc=0.0105 
Cg  =0.000602,  pg  =0.958,  rg  =0.0144 
Ct=0.000236,  Pt=0.149,  r7-=0.000939 
Cp=3.11,  pf=0.421  (conditional  on  C  and  T),  rp=5.37 

Note  how  a  redimdant  filter  can  still  be  highly  useful:  Coarse-grain  rules  out  an  average  of  97%  of  the  database 
in  very  little  time. 


If  we  could  igntne  the  parser  (as  when  certain  queries  are  common,  and  their  meaning  lists  stored  in  advance). 


-24- 


the  interchange  sort  of  these  four  filters  would  be  TOUR  (picture-type  ntaicher,  coarse-grain  matcher, 
registration-data  tester,  and  fine-grain  matcher).  This  order  satisfies  the  dependencies.  It  would  also  satisfy 
deletion  qptimality  since  0.000939<0.010:£  for  deletion  of  T  and  0.10S<:0.(X)602+(0.9S8'*‘3.11)  for  deletion  of  C. 
C  would  be  strong-deletion-optimal  since  it  is  not  followed  by  an  entailed  (deletable)  filter;  T  would  be  strong- 
deletion-optimal  because  0.000236<0.0102.  F  would  satisfy  strong  intmhange-optimality  because 
0.000602<5.37.  Hence  TOIF  wtHild  be  globally  optimal. 

But  the  parser  imposes  an  minimum  time  for  T  and  F  from  the  start  of  processing.  The  parser  is  going  to  be  on 
the  critical  path  for  the  optimal  execution  plan,  because  even  if  C  and  R  are  done  sequentially  while  the  parser 
is  executing,  230*(0.0102+(0.0305*0.(X)602)>=2.39  is  the  expected  time  for  the  sequence  C-R,  and  we  obtained 
an  average  of  c/>=3.76  seconds  for  parse  CPU  time  for  the  44  queries.  Hence  the  parser  and  F  will  be  on  the 
critical  path  (F  cannot  be  deleted),  and  must  occur  in  that  order;  if  T  occurs,  it  must  be  between  the  parser  and 
F  because  of  the  dependencies.  T  must  occur  because  it  is  strongly-deletion-optimal,  since  0.000236<3.11. 
Hence  the  critical  path  is  parser-T-F, 

As  for  paraUelism,  the  parser  cannot  be  decomposed  into  pardlel  tasks  as  currently  implemented.  And  parallel¬ 
ism  is  of  no  advantage  to  C  and  R  because  they  are  not  cm  the  critical  path,  even  if  they  are  taken  sequentially. 
So  execute  sequentially  C-R  on  a  processcH'  to  run  in  parallel  with  the  parser.  As  for  T  and  F,  we  must  estimate 
the  ovohead  of  parallelism.  We  conducted  experiments  with  the  Quintus  Prolog  cotnmunications  package  TCP 
using  1,  2,  3,  and  4  processors  for  the  fine-grain  matcher.  In  these  experiments,  we  observed  no  statistical  effect 
of  the  number  of  processors  on  overhead  cost.  So  we  fit  cost  to  the  formula  ka+{Cf/N)  for  the  cost  of  distribut¬ 
ing  fine-grain  over  N  processors,  and  estimated  k„=OA  as  the  amortized  overhead  per  data  item.  Hence  T  and  F 
should  be  allocated  the  maximum  number  of  available  processors  each,  and  all  of  T’s  executions  must  precede 
all  of  F’s  executions,  following  Theorem  2.2.  Fig.  12  summarizes  the  optimal  execution  plan. 

To  confirm  the  reasonability  of  the  cost  estimates,  we  conducted  fiurdier  tests.  Among  otho*  things,  we  meas¬ 
ured  real  time  to  the  nearest  second  fcK'  execution  of  the  four  filter  sequences  CF,  CTF,  F,  and  TF  on  a  Sun 
SparcStation.  We  measured  real  time  to  be  sure  to  include  all  the  factors  that  affect  execution  time,  but  the 
workstation  used  a  fileserver  that  saves  other  users,  so  our  measurements  are  not  precise.  In  42  queries  of  the 
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ptevkwsly  supplied  44  (two  of  which  showed  new  bugs  and  were  excluded),  the  observed  ratio  of  real  execution 
time  for  F  to  TF  was  1.18  with  a  standard  deviation  of  0.43.  versus  a  theoretical  ratio  of  1.33;  and  the  ratio  (rf  F 
to  CF  was  22.1  with  a  standard  deviation  of  17.3.  versus  a  theoretical  ratio  of  29.7.  The  observed  ratio  of  real 
execution  time  for  CF  to  CTF  was  2.20  with  a  standard  deviation  of  l.SO,  versus  a  thecuetical  ratio  by  our  above 
methods  of  1.33;  the  data  was  in  the  form  of  small  integers  avoaging  about  20,  so  this  measure  was  more  crude. 
In  these  same  experiments,  F  was  never  faster  than  CF  and  TF  was  never  faster  than  CTF,  as  predicted  by 
theory;  and  F  was  faster  than  TF  as  predicted  in  only  9  out  of  42  cases,  and  only  slightly  faster  in  each  of  the  9. 
These  results  are  confirmation  of  our  theory  considering  that  our  method  of  parameter  estimation,  by  averaging 
over  all  queries,  biases  performance  toward  the  few  queries  with  many  answers. 

52.  Scaling  up  the  database 

The  MARIE  system  was  just  a  prototype  implementation  that  handled  a  random  sample  of  1/166  of  the  entire 
NAWC  database,  which  at  the  time  of  the  sample  was  36,(XX)  captioned  data  items.  We  can  use  the  preceding 
analysis  to  predict  the  optimal  execution  plan  when  MARIE  is  applied  to  the  full  database. 

The  total  time  to  do  fine-grain  matching  and  picture-type  matching  should  increase  by  166  since  each  match  is 
indqtoident  and  there  are  no  economies  of  scale.  Registraticm-data  testing  will  be  close  to  166  times  slower 
because  it  is  best  implemented  with  indexes,  and  the  indexes  will  be  166  times  longer  on  the  average.  Coarse- 
grain  matching  will  be  dominated  by  the  intersection  of  lists  166  times  longer  on  the  average,  so  it  will  also  be 
close  to  166  times  slower.  Only  the  parser  will  remain  nearly  the  same  speed,  since  the  grammar  it  handles  is 
nearly  a  complete  grammar  for  the  remaining  ctqttions,  and  most  of  the  parse  time  is  in  fetch  firom  a  hashed  lex- 
ictm,  backtracking  among  grammar  choices,  and  reasoning  about  lexicon  information  for  words,  activities  not 
affected  by  the  size  of  the  database. 

That  means  the  effective  time  cost  for  the  parser  is  reduced  to  3.76/36000=0.000104  per  data  item.  This  still 
rules  out  the  sequence  TCRF  which  was  optimal  without  considering  the  parser,  but  still  allows  the  sequences 
CRTF,  CIRF,  RCTF,  RTCF,  CTFR,  RTF,  CRF,  RQF,  CFR,  RF,  and  FR.  Deletion  optimality  is  not  affected  by 
the  inclusion  of  the  parser,  so  we  need  only  consider  the  four-filter  sequences  CRTF,  CTRF,  RCTF,  RTCF,  and 
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CTFR.  The  optimal  solution  when  a  single  inequality  constraint  is  imposed  upon  a  i^oblem  is  one  in  which  the 
inequality  constraint  is  "active”  or  at  the  border  of  infeasibility,  which  means  T  must  be  second  in  the  sequence, 
leaving  only  CIIUF,  RTCF,  and  CTFR  as  possibilities.  But  in  the  second  of  those  R  precedes  C,  and  in  the  third 
of  those  F  precedes  R,  both  of  which  are  inconsistent  with  interchange  optimality.  Hence  sequence  CTRF  is  the 
optimal  one,  with  the  parser  in  parallel  with  C.  As  before,  there  is  no  benefit  to  putting  any  of  the  filters  in 
parallel  with  different  filters,  but  there  is  an  advantage  in  data-partition  parallelism  on  each  filter.  So  the 
optimal  execution  plan  with  N  available  filtos  is  to  put  1  processor  on  the  parser  in  parallel  with  N-1  processors 
on  C,  then  N  processors  on  the  sequence  TRF  on  different  random  partitions  of  the  database. 

5  J.  Other  modifications  of  MARIE 

Our  approach  also  permits  straightforward  analysis  of  several  interesting  hypothetical  modifications.  If  the 
parser  finds  natural-language  input  to  be  ambiguous,  alternative  meaning  lists  can  be  generated.  Then  the  T  and 
F  processors  can  execute  a  disjunction  of  the  alternatives,  using  the  methods  of  section  4. 

Another  idea  is  to  ^lit  the  coarse-grain  filter  into  separate  filters  for  each  noun  of  the  query.  The  current  imple¬ 
mentation  uses  a  single  hetqr  structure  for  all  input  nouns,  but  a  simpler  implementation  would  load  and  intersect 
index  files,  fin^g  crqttions  belonging  to  each  of  a  set  of  index  files  (assuming  an  exact  match  to  the  query  is 
desired).  Then  large  index  files  have  both  a  high  cost  and  a  lu^h  probability  of  success,  hence  a  high  ratio 
c/(l-p).  Hence  if  we  partition  the  coarse-grain  matching  into  sepavue  filters  for  each  noun,  the  filters  should  be 
sorting  by  inoeasing  frequency  of  noun  occurrence.  This  may  mean  that  other  filths  like  R  and  T  can  now  be 
interieaved  with  coarse-grain  films,  and  placed  before  filters  for  high-fiequency  nouns  (like  "view"  and  "test”  at 
NAWC-WD). 

Yet  anodier  idea  is  to  treat  the  user  as  another  filter,  as  we  suggested  earlier.  The  user  will  examine  data  items 
supplied,  and  will  accept  some  of  them.  We  can  assume  that  any  user  knows  better  what  they  want  than  any 
automatic  filter,  so  this  user  "film"  U  entails  all  the  others  discussed;  but  to  maintain  a  tree  structure  for  entail- 
ments,  U  must  only  directly  entail  fine-grain  matching  (F).  Since  F  must  go  last  without  this  user  filter,  U  when 
present  must  be  last  The  cmly  question  then  is  deletion  optimality  of  the  filters.  The  deletion  of  C  or  T  cannot 
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afiect  U  becMise  they  are  "buffered"  by  F.  The  deletion  R  is  piagmatically  impossible  to  affect  U  because 
rj(=0.0144  in  the  current  MARIE  implementation,  which  is  far  less  than  the  time  it  would  take  a  human  to 
assess  the  relevance  of  a  picture.  The  deletion  of  F  is  only  desirable  if  S37IN>C(j ,  N  the  number  of  processors. 
Thus  if  time  of  the  user  to  confirm  answers  is  the  only  cost  criterion,  we  should  only  delete  F  if  the  user  aver¬ 
ages  less  than  5.37IN  seconds  to  assess  a  picture;  if  other  criteria  were  included  in  the  cost  like  bother  to  the 
user,  this  threshold  would  decrease.  If  F  should  be  deleted,  then  we  can  next  consider  deleting  other  filters,  but 
their  values  of  the  ratio  r  are  so  much  lower  that  it  is  pragmatically  impossible  for  this  to  be  desirable. 

5A,  Further  capabilities  with  a  mixed  query  language 

Beyond  die  capabilities  just  described,  we  have  implmented  for  MARIE  an  enhanced  query  capability  in  a 
SQL-like  format,  allowing  arbitrary  nesting  of  boolean  expressions,  including  possibly  multiple  natural-language 
strings  and  multiple  registration-data  restrictions.  The  methods  of  sections  2  and  4  can  be  applied  to  these 
enhanced  queries.  Our  query  language  adds  an  additional  comparator  "MATCHES”  to  SQL,  to  initiate  natural- 
language  processing  and  semantic  matching.  It  does  not  handle  joins,  however,  since  we  believe  that  good  cap- 
tkms  and  a  good  type  hierarchy  can  eliminate  most  need  for  them  in  multimedia  databases.  The  availability  of 
such  a  "mixed”  query  language  with  both  conventional  SQL  and  natural-language-descriptor  features  means  that 
the  natural-language  processing  does  not  need  to  handle  complicated  scoping  rules  for  quantifies  like  ”not”, 
”or",  and  "all”,  which  can  be  very  tricky  to  analyze  in  English,  since  the  use  can  express  such  distinctions  with 
the  femal  part  of  the  query  language.  Programmers  can  use  our  modified  SQL  directly,  but  we  also  {xovide  a 
grtqrhical  interfitce  for  naive  users  that  permits  structured  query  formulation. 

6.  Conclusions 

I 

We  have  explored  a  new  iqjproach  to  information  retrieval,  the  concept  of  information  filtering.  Our  qrproach 
has  focussed  tm  the  system  aspects  of  filtering  ratho'  than  the  details  of  the  filters,  and  our  work  should  nicely 
complement  the  results  (Hi  filter  design  provided  by  classic  information-retrieval  methods  and  work  (mi 
signature-based  retrieval.  Our  tqrproach  has  been  mostly  analytical,  providing  local  optimality  criteria  for  filter 
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execution  plans.  Thus  it  contrasts  with  work  on  query  (^Jdmization  for  database  systems,  for  which  the  search 
qpace  is  so  difficult  to  analyze  that  methods  must  be  either  exhaustive  or  heuristic;  our  local  optimality  criteria 
are  consideraUy  more  sofrtiisticated  than  the  "cheapest-first”  heuristic  often  used  there.  Inframation  filtering  thus 
tqqxars  to  be  a  qtecial  case  of  general  database  retrieval  that  has  q)ecial  exploitable  prtqterties  for  imi»oving 
efficiency.  The  methods  we  have  proposed  will  be  particularly  useful  fw  the  design  of  multimedia  information- 
retrieval  systems,  fOT  which  concq)tually  distinct  filters  can  easily  be  derived. 
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JUter(U.79283J02m59JiidepJ). 

jater(2aJ8817J0J06212294ep.W. 

JUter(3jBj6706JOj0347425/lepXW. 

JUter(4^j636e5J0.0574234jindepJ). 

Cost-prob  ratios:  [4.78325^3331 1JSJ98268$A6281J 

[13.431:431338 

[2.43] :  2.75564  (local  optimum) 

[3.4] :  8S7066  (local  optfnnm) 

[1.43] :  5.68422 

The  uumbtr  of  local  optima:  2 
Heuristic  optimum:  [2.43] 

Fi|we  1:  Esuapk  oatpiit  ot  tte  coQjanctive-sequence  tester 
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no.  €f 

prod. 

fnb. 

heuristic 

JUim 

etUttUing 

site  of  space 

laurAer  of  optima 

cost  ratio 

3 

02 

02 

0.0554518(P.0188046) 

9.0138629(0.00970406) 

0.0(0.0) 

4 

02 

02 

0.152492(0.0303405) 

0.00693147(0.00689673) 

0.0(0.0) 

5 

02 

02 

0.263396(0.0446852) 

01)38712(0.0172595) 

0.0(0.0) 

6 

02 

02 

0.505997(0.0641714) 

0.0762462(0.0216879) 

0.0(0.0) 

7 

02 

02 

0.526792(0.0651116) 

0.112081(0.0272095) 

0.0(0.0) 

8 

02 

02 

0.672353(0.0749465) 

0.114958(0.0266312) 

0.0(0.0) 

9 

02 

02 

0.977338(0.0930445) 

0.156547(0.0322006) 

0.0(0.0) 

10 

02 

02 

1.21301(0.113208) 

0.168711(0.0340448) 

0.0(0.0) 

11 

02 

02 

1.2338(0.116638) 

0.238026(0.039367) 

0.0(0.0) 

12 

02 

02 

1.45561(0.133149) 

0335557(0.0417515) 

0.0145623(0.0144893) 

13 

02 

0^ 

2.07251(0.145889) 

0343708(0.0472216) 

0.0(0.0) 

14 

02 

0^ 

2.25273(0.147813) 

0.35064(0.0516756) 

0.0(0.0) 

15 

02 

0^ 

2.12796(0.161744) 

0.327977(0.054881) 

0.0125881(0.00800236) 

16 

02 

02 

2.74486(0.170891) 

0.471194(0.0614881) 

0.0151046(0.0142948) 

17 

02 

02 

3X14985(0.179684) 

0.478125(0.0579669) 

0.0008991(0.0006283) 

18 

02 

02 

339642(0.185603) 

0.622168(0.0740848) 

0.0008681(0.0008096) 

19 

02 

02 

4.07571(0312578) 

0.832629(0.0828184) 

0.0521339(0.0289764) 

20 

02 

02 

4.19354(0.199061) 

0.695475(0.0637804) 

0.0360905(0.0264117) 

3 

02 

0.8 

0377259(0.0449211) 

0.00693147(0.0068%73) 

0.0(0.0) 

4 

02 

0.8 

0347586(0.0557068) 

0.0207944(0.0118242) 

0.0(0.0) 

5 

02 

0.8 

0.998132(0.065259) 

0.103972(0.0247503) 

0.0(0.0) 

6 

02 

0.8 

136846(0.0832094) 

0.128821(0.0294552) 

0.0(0.0) 

7 

02 

0.8 

1.85763(0.0887012) 

0353066(0.0391569) 

0.0120575(0.00854089) 

8 

02 

0.8 

2.38443(0.10481) 

0.360936(0.0487895) 

0.0130564(0.012991) 

9 

02 

0.8 

2.80725(0.0986974) 

0.46442(0.0569815) 

0.0332153(0.0252187) 

10 

02 

0.8 

3.16768(0.10009) 

0.550636(0.056419) 

0.0347772(0.0245527) 

11 

02 

0.8 

4.08957(0.108937) 

0.619519(0.0599677) 

0.0317728(0.0316135) 

12 

0^ 

0.8 

4.42921(0.113004) 

0.708135(0.0682391) 

0.018403(0.016857) 

13 

0^ 

0.8 

5.15702(0.114451) 

0.664393(0.068758) 

0.0346816(0.0187861) 

14 

0^ 

0.8 

5.85709(0.100626) 

0.78386(0.0816895) 

0.0176791(0.0119071) 

15 

0^ 

0.8 

6.40468(0.1131) 

0.89792(0.0758427) 

0.0402969(0.0223391) 
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no.  of 
filters 

prob. 

entailed 

prob. 

entailing 

size  of  space 

number  of  optima 

heuristic 
cost  ratio 

3 

OX 

0.270327(0.448086) 

0.0277259(0.0135829) 

0.0(0.0) 

4 

0.8 

02 

0.60997(0.681121) 

0.0277259(0.0135829) 

0.0(0.0) 

5 

0.8 

02 

0.89416(0.971866) 

0.0415888(0.0164613) 

0.0(0.0) 

6 

OX 

02 

1.44175(0.119125) 

0.0855333(0.0264176) 

0.0(0.0) 

7 

OX 

0.2 

1.6061(0.144361) 

0.157725(0.0355865) 

0.0(0.0) 

8 

0.8 

0.2 

2.18341(0.170456) 

0.122422(0.0320054) 

0.0135923(0.0132327) 

9 

0.8 

0.2 

3.02212(0.174673) 

0.156026(0.0367928) 

0.00639268(0.00636064) 

10 

0.8 

0.2 

3.37563(0.190245) 

0.194092(0.0383726) 

0.00763135(0.00753294) 

11 

0.8 

0.2 

3.81924(0.219739) 

0.262072(0.0491324) 

0.0782128(0.0489747) 

12 

0.8 

0.2 

4.92134(0J!22132) 

0251086(0.0458398) 

0.049436(0.0312112) 

13 

0.8 

0.2 

5^5211(0.233519) 

0229114(0.0449624) 

0.0930229(0.039692) 

14 

0.8 

0.2 

5.46893(0.269864) 

0.326485(0.0540142) 

0.0569272(0.0359603) 

15 

0.8 

0.2 

6.01652(0.261727) 

0.362034(0.0564617) 

0.0817849(0.0308793) 

3 

1.04665(0.0432815) 

0.0207944(0.0118242) 

0.0(0.0) 

4 

1.6081(0.0562091) 

0.0762462(0.0216879) 

0.0(0.0) 

5 

0.8 

0.8 

2.39829(0.0549469) 

0.0733694(0.0223444) 

0.0(0.0) 

6 

0.8 

0.8 

3.02212(0.0567535) 

0.0872323(0.0239395) 

0.0219433(0.0218333) 

7 

0.8 

0.8 

3.80538(0.0567323) 

0.15367(0.0318464) 

0.0616014(0.02526) 

8 

0.8 

0.8 

4.47773(0.0513303) 

0.180875(0.0390725) 

0.0680312(0.0468534) 

9 

0.8 

0.8 

5.1986(0.0541365) 

0.162188(0.0372266) 

0.0712856(0.03593) 

10 

0.8 

0.8 

5.8363(0.0538518) 

0.193968(0.0404172) 

0.0648062(0.0241476) 

11 

0.8 

0.8 

6.63342(0.0482771) 

0242613(0.0448573) 

0.118454(0.0395468) 

12 

0.8 

0.8 

7.30577(0.0549469) 

0222182(0.0436092) 

0.164913(0.052674) 
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Class  qf  logical 
equivt^nces 

Optimal  independent  of 
costs  and  jHrdbabiMes? 

Satisfied  by 

a  greedy  (nonbacbracking) 
algorithm? 

Coiyuncdve  commutivity 

no 

yes 

Conjunctive  elimination  of 
entailed  filters 

no 

no 

Diqunctive  commutivity 

no 

yes 

Digunctive  dimination  of 
entailed  filters 

no 

no 

Various  redundancy 
eliminations 

yes 

yes 

Disnibutivity  fiKtoiing 

yes 

yes 

DeMorgan’s  Laws  inward 

yes 

no 

Flgwe  12:  Saaiauuy  of  the  opthnaH^  statns  of  the  standard  boolean  man^nlatioiis 
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mtunl- 

rcgisltation 

•gnin 

language 

data 

matdier 

parser 

lestm' 

of  query 

(no  joins) 

\ 

1 

pcture-type 

clwcker 

\  fine-grain  j 

/ 

/ 

/ 

\  semantic  / 
^matcher  / 

1  / 

content  matcher, 
either  by  user 
or  by  computer 

(optional) 

Figure  13:  Dependencies  between  filters  in  the  MARIE  system 
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Flgiire  14:  Optimal  ewcatkm  plan  for  die  MARIE  system,  folkming  the 

analysis  in  the  text 
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